GapminderData <- read_csv(file = Gapminder_Filelink) %>%
as_tibble(show_col_types = FALSE) %>%
select(-`...1`)
What we see here is the Gapminder dataset (even though it says it’s cleaned it’s not….). This dataset details various metrics, ranging from economic to agriculture, that describes specific countries within the world over time.
filtered_year <- 1962
GapminderFilteredYear <- GapminderData %>%
dplyr::filter(Year == filtered_year)
ScatterPlot <- GapminderFilteredYear %>%
ggplot(., aes(
x = gdpPercap,
y = `CO2 emissions (metric tons per capita)`
)) +
geom_point() +
theme_classic()
## Warning: Removed 151 rows containing missing values (geom_point).
From our dataset, we can see that there is a positively linear relationship between CO2 emissions and GDP per capita. Now lets investigate further on how strong the correlation is based on the pearson correlation (R) coefficient.
test <- "pearson"
rm_na <- "complete.obs"
pearson_corr <- cor(GapminderFilteredYear$`CO2 emissions (metric tons per capita)`,
GapminderFilteredYear$gdpPercap,
method = test,
use = rm_na
) * 100
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 151 rows containing non-finite values (stat_smooth).
## Warning: Removed 151 rows containing non-finite values (stat_cor).
## Warning: Removed 151 rows containing missing values (geom_point).
## The pearson correlation coefficient between CO2 emissions and GDP per capita is 92.6%
From what we can see here, the pearson correlation coefficient is approximately 92.61%, meaning that there is a strong positive correlation between CO2 emissions and GDP per capita in all countries in the year of 1962. In addition, the p-value (2.2 * 10^-6) is less than 0.05, meaning that the correlation of the two variables are significant to one another. Now lets take a look at all years and see which has the highest pearson correlation coefficient.
test <- "pearson"
rm_na <- "complete.obs"
CO2_year <- vector(mode = "list")
gdpPercap_year <- vector(mode = "list")
PearsonCorrYears <- vector(mode = "list")
YearChartoNum <- vector(mode = "list")
GapminderYear <- GapminderData %>% # selecting the all the unique years iteration
select(Year) %>%
unique() %>%
pull() %>%
as.character() # For names in the list
PearsonCorrYears <- GapminderYear %>% # Make into a list by iterating through the years
sapply(.,
USE.NAMES = TRUE,
simplify = FALSE,
function(year) {
YearChartoNum[[year]] <- year %>% # Convert characters to numeric values
as.numeric()
CO2_year[[year]] <- GapminderData %>% # list for the CO2 emissions by year
filter(Year == YearChartoNum[[year]]) %>%
select(`CO2 emissions (metric tons per capita)`) %>%
pull()
gdpPercap_year[[year]] <- GapminderData %>% # list for the GDP per capita by
filter(Year == YearChartoNum[[year]]) %>% # year
select(gdpPercap) %>%
pull()
cor(
x = GapminderData %>% # Pearson Correlation coefficient iterated by year
filter(Year == YearChartoNum[[year]]) %>%
select(`CO2 emissions (metric tons per capita)`) %>%
pull(),
y = gdpPercap_year[[year]],
method = test,
use = rm_na
)
}
) %>% unlist()
After iterating over the years in the Gapminder dataset, we can see that the highest Pearson correlation coefficient occurs in 1967 suggesting that year has the strongest correlation (93.88%) between CO2 emissions and GDP per capita. Now lets filter the Gapminder dataset again with that year and plot a scatterplot through plotly.
PearsonCorrMaxYear <- PearsonCorrYears[which.max(PearsonCorrYears)] %>%
names() %>%
as.numeric() # Finding the max year for the analysis
GapminderFilteredMax <- GapminderData %>% ## Filter by year with the highest Pearson
filter(Year == PearsonCorrMaxYear) ## correlation coefficient of CO2 and GDP
GapminderMaxplot <- GapminderFilteredMax %>% ## ggplot implementation
ggplot(., aes(
x = `CO2 emissions (metric tons per capita)`,
y = gdpPercap,
size = pop
)) +
geom_point() +
theme_classic()
Here is the plotly implementation of the Gapminder dataset during the year of 1962. The scatterplot is interactive and you can see the different values (gdpPercap, CO2 emissions) in each point of the plot.
GapminderContinentEnergyUse <- GapminderData %>%
select(continent, `Energy use (kg of oil equivalent per capita)`) %>%
na.omit()
In order to determine the exact relationship between the each continent (categorical variable) and Energy use (kg of oil equivalent per capita) (continuous variable), we need to construct a linear model of the data and do ANOVA across all the continent groups.
lm(
formula = `Energy use (kg of oil equivalent per capita)` ~ continent,
data = GapminderContinentEnergyUse
) %>% summary()
##
## Call:
## lm(formula = `Energy use (kg of oil equivalent per capita)` ~
## continent, data = GapminderContinentEnergyUse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2796.0 -1107.5 -349.1 276.8 12904.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.5 137.2 5.090 4.42e-07 ***
## continentAmericas 1005.1 196.9 5.105 4.10e-07 ***
## continentAsia 1168.8 197.7 5.911 4.93e-09 ***
## continentEurope 2447.5 183.0 13.377 < 2e-16 ***
## continentOceania 3281.8 454.1 7.227 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1936 on 843 degrees of freedom
## Multiple R-squared: 0.1963, Adjusted R-squared: 0.1924
## F-statistic: 51.46 on 4 and 843 DF, p-value: < 2.2e-16
From the results above, we have the following equation:
Y = continentAmericas * x1 + continentAsia * x2 + continentEurope * x3 + continentOceania * x4 + Intercept
Y = Energy use (kg of oil
equivalent per capita) (continuous variable)
x1 = North and South
America (categrorical variable i.e 0 or 1)
x2 = Asia (categrorical
variable i.e 0 or 1)
x3 = Europe (categrorical variable i.e 0 or 1)
x4 = Oceania (categrorical variable i.e 0 or 1)
continentAmericas = 1005.1
continentAsia = 1168.8
continentEurope = 2447.5
continentOceania = 3281.8
Intercept
= 698.5
We can conclude that since all the p-values are less than 0.05 for each continent, we can reject the null hypothesis and interpret that all the continent variables have a statistically significant relationship. Furthermore, since all the continent coefficents are positive, we can see that each continent has a positive relationship with the Energy use (kg of oil equivalent per capita).
continents_of_interest <- c("Europe", "Asia")
variables_of_interest <- c("continent", "Year", "Imports of goods and services (% of GDP)")
Euro_Asia_Imports <- GapminderData %>%
dplyr::filter(continent %in% continents_of_interest &
Year > 1990) %>%
dplyr::select(variables_of_interest) %>%
na.omit()
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(variables_of_interest)` instead of `variables_of_interest` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
ggboxplot(Euro_Asia_Imports,
x = "continent",
y = "Imports of goods and services (% of GDP)",
color = "continent",
add = "jitter",
shape = "continent"
)
In order to determine whether or not there is a significant difference between Europe and Asia based on Imports of goods and services (% of GDP) after 1990, we need to determine whether we use the t-test or the Welch t-test.
From what we see from this boxplot above, there are differences of variation (i.e. spread) of the data between Asia and Europe, so the Welch t-test is the most appropriate for this type of test.
t.test(`Imports of goods and services (% of GDP)` ~ continent,
data = Euro_Asia_Imports
)
##
## Welch Two Sample t-test
##
## data: Imports of goods and services (% of GDP) by continent
## t = 1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
## -2.321099 12.433240
## sample estimates:
## mean in group Asia mean in group Europe
## 46.84531 41.78924
Since the p-value (0.1776) > 0.05, we need to fail to reject the null hypothesis. We can conclude that there is not a statistically significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990.
order_of_variables <- c(
"Year",
"Country Name",
"Population density (people per sq. km of land area)"
)
GapminderPopDensity <- GapminderData %>%
select(
`Country Name`,
`Population density (people per sq. km of land area)`,
Year
) %>%
na.omit() %>%
dplyr::group_by(Year) %>%
slice(which.max(`Population density (people per sq. km of land area)`)) %>%
dplyr::select(order_of_variables)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(order_of_variables)` instead of `order_of_variables` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
The datatable below shows which country has the highest average population density in each time point in the dataset. From what we see, Monaco and Macao SAR, China have been dominating the population density from 1962-2007.